ggplot2 figures for your manuscriptFor this short workshop we will use two datasets and create two separate plots, a bar chart with error bars and a depth profile plot. From this we should learn how to create these plots with ggplot2, learn that datasets need to be manipulated for different visualizations, and learn how to save the plots to files for inclusion into a manuscript.
(These instructions inspired by the R Graphics Cookbook)
Prior to jumping into coding any figure I like to think a bit about what I want to show and will actually sketch that out on paper. Doing this also forces you to start thinking about what data you need to create that figure. In the case of the bar chart with error bars at a minimum we will need some mean value to plot and some representation of error for the error bars. Another common thing to do is plot this for multiple categories and for multiple values.
For this example, we will be interested in looking at Nitrogen and Phosphorus concentrations across ecoregions. Let’s first start with the data.
Assuming we still have access to it, we will use the NLA 2012 water quality data for the bar chart with error bars. Specifically we will be interested in total nitrogen, total phosphorus, and ecoregion. First, let’s load up our packages:
library(dplyr) # For some basic data massaging
library(tidyr) # Also for some basic data massaging
library(ggplot2) # For the plots
Next, to get the data:
nla12_wq_url <- "https://www.epa.gov/sites/production/files/2016-12/nla2012_waterchem_wide.csv"
nla12_site_url <- "https://www.epa.gov/sites/production/files/2016-12/nla2012_wide_siteinfo_08232016.csv"
# Read, join, filter, select data
nla12_wq <- read.csv(nla12_wq_url, stringsAsFactors = FALSE)
nla12_site <- read.csv(nla12_site_url, stringsAsFactors = FALSE)
nla12 <- full_join(nla12_wq, nla12_site) %>% filter(VISIT_NO == 1) %>% select(SITE_ID,
FW_ECO9, NTL_RESULT, PTL_RESULT) %>% na.omit()
## Joining, by = "UID"
tbl_df(nla12)
## # A tibble: 1,130 × 4
## SITE_ID FW_ECO9 NTL_RESULT PTL_RESULT
## * <chr> <chr> <dbl> <dbl>
## 1 NLA12_MS-116 CPL 0.396 22
## 2 NLA12_MS-R03 CPL 1.226 66
## 3 NLA12_MS-119 CPL 0.990 45
## 4 NLA12_VT-101 NAP 0.261 18
## 5 NLA12_TX-141 CPL 0.445 30
## 6 NLA12_TX-169 CPL 4.065 54
## 7 NLA12_OR-133 WMT 0.160 20
## 8 NLA12_OR-131 WMT 0.625 197
## 9 NLA12_TX-158 CPL 0.806 88
## 10 NLA12_ID-107 XER 0.895 383
## # ... with 1,120 more rows
So now we have total nutrients and a categorical representing the ecoregion for the data. Chances are that the data won’t be in the form that we need to actually build the plot.
Remember, for this example we are going to plot nutrients vs. ecoregions, thus, we will need to summarize the per lake data on an ecoregional basis and get the mean and standard error for each nutrient within each ecoregion. Also, many of the ggplot2 functions will easily create seperate plots if a catergorical factor is supplied. We will keep this in mind becuase we want different bars for each of the variables.
Given this we need a data frame that looks like:
| Ecoregion | Nutrient | Mean | Standard Error |
|---|
nla12_bar_mean <- nla12 %>% group_by(FW_ECO9) %>% summarize(nitrogen = mean(log1p(NTL_RESULT)),
phosphorus = mean(log1p(PTL_RESULT))) %>% gather("variable", "mean", 2:3)
nla12_bar_se <- nla12 %>% group_by(FW_ECO9) %>% summarize(nitrogen = sd(log1p(NTL_RESULT))/sqrt(length(NTL_RESULT)),
phosphorus = sd(log1p(PTL_RESULT))/sqrt(length(PTL_RESULT))) %>% gather("variable",
"std_error", 2:3)
nla12_bar_data <- full_join(nla12_bar_mean, nla12_bar_se)
## Joining, by = c("FW_ECO9", "variable")
nla12_bar_data
## # A tibble: 18 × 4
## FW_ECO9 variable mean std_error
## <chr> <chr> <dbl> <dbl>
## 1 CPL nitrogen 0.6822705 0.03554801
## 2 NAP nitrogen 0.3253554 0.01715347
## 3 NPL nitrogen 1.1103478 0.06460114
## 4 SAP nitrogen 0.4071744 0.02794591
## 5 SPL nitrogen 0.9404627 0.06754431
## 6 TPL nitrogen 0.8761736 0.03329906
## 7 UMW nitrogen 0.5806731 0.02239086
## 8 WMT nitrogen 0.2937832 0.01983497
## 9 XER nitrogen 0.4977342 0.03703743
## 10 CPL phosphorus 4.1397335 0.08531323
## 11 NAP phosphorus 2.8811745 0.07004673
## 12 NPL phosphorus 5.0297885 0.13739033
## 13 SAP phosphorus 3.4282515 0.09277573
## 14 SPL phosphorus 4.6488980 0.13819771
## 15 TPL phosphorus 4.4939698 0.09297225
## 16 UMW phosphorus 3.4542994 0.05562500
## 17 WMT phosphorus 3.3918517 0.07033869
## 18 XER phosphorus 4.1589367 0.11952722
Now that was easy! I am being sarcastic, because getting the data ready is really about 90% of the effort. In reality this usually takes me a couple of iterations of getting things wrong. Now we can use this to build out our plot.
nla12_bar <- ggplot(nla12_bar_data,aes(x = FW_ECO9, y = mean, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=mean-std_error, ymax=mean+std_error),
width=.2, # Width of the error bars
position=position_dodge(.9))
nla12_bar
So that looks pretty good, but their might be a lot of other change you might want to see with this figure. We will consider three:
I find bar charts to be fairly difficult to read accurately so getting order of the ecoregions is not easy. One way to get at that is to order the x-axis based on one of the variables. For this we will re-order the axis, in descending order of mean Phosphorus. ggplot2 uses the order of a factor to do this. And while we could do this with base R, that’d make our heads hurt, so Hadley to the rescue with the forcats package and som magrittr kung fu.
library(forcats)
# First create a character vector of levels in the proper order
eco9_ord <- nla12_bar_data %>% filter(variable == "phosphorus") %>% arrange(desc(mean)) %>%
.$FW_ECO9
nla12_bar_data <- nla12_bar_data %>% mutate(desc_ecoregion = fct_relevel(factor(FW_ECO9,
eco9_ord)))
Now with that done, we can recreate our plot from above
nla12_bar <- ggplot(nla12_bar_data,aes(x = desc_ecoregion, y = mean, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=mean-std_error, ymax=mean+std_error),
width=.2, # Width of the error bars
position=position_dodge(.9))
nla12_bar
And lastly maybe I want phosphorus first. This is a bit more straightforward.
nla12_bar <- ggplot(nla12_bar_data,aes(x = desc_ecoregion, y = mean, fill = fct_relevel(factor(variable),"phosphorus" ,"nitrogen")))+
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=mean-std_error, ymax=mean+std_error),
width=.2, # Width of the error bars
position=position_dodge(.9))
nla12_bar
Believe it or not, there is a fair bit of research behind which colors we should use for plots that aid in interpretation, are readable by those with colorblindness, etc. So the defualt is probably good while we build the plot, but we almost always want to move beyond that. For this we will use the viridis package
library(viridis)
nla12_bar <- nla12_bar + scale_fill_viridis(discrete = TRUE)
nla12_bar
Lastly, the default theme is fine, but we probably want to tweak it some. In particular for this plot let’s change up the background and fix our legend title.
First, let’s fix the legend title.
nla12_bar <- nla12_bar + guides(fill = guide_legend(title = "Nutrients"))
nla12_bar
Now let’s work on a different look and feel.
nla12_bar <- nla12_bar + labs(x = "Mean Concentration", y = "Ecoregions") +
theme(text = element_text(family = "serif"), panel.background = element_blank(),
panel.grid = element_blank(), panel.border = element_rect(fill = NA),
plot.title = element_text(family = "sans", size = 12, face = "bold",
vjust = 1.1), legend.position = c(0.85, 0.85), legend.key = element_rect(fill = "white"),
legend.text = element_text(family = "sans", size = 15), legend.title = element_text(family = "sans",
size = 11), axis.title.x = element_text(family = "sans", vjust = -0.5,
size = 12), axis.title.y = element_text(family = "sans", vjust = 1.5,
size = 12), axis.text.x = element_text(family = "sans", size = 11),
axis.text.y = element_text(family = "sans", size = 11))
nla12_bar
This shows the way to do that with a bunch of custom settings. For a quicker version of doing this, you can use some of the canned themes in ggplot2 or use the ggthemes package for many additional ones. Some example sof using one of the ggplot2 ones is below. Also note, I am not saving these to an object so result is a temporary view of what the plot would have looked like.
nla12_bar + theme_bw()
nla12_bar + theme_classic()
nla12_bar + theme_minimal()
Once you have the details of your figure, figured (he, he) out you need to move it from the screen and into your manuscript. This requires saving the output to a file. The ggplot2 package comes with a function to facilitate this, ggsave(). To output a ggplot2 object to a high resolution tiff:
ggsave(filename = "nla_bar_chart.tiff", plot = nla12_bar, path = "meetings",
width = 8, height = 4, units = "in", dpi = 300)
## Warning in grDevices::dev.off(): No TIFF support in this version of R
This should get you pretty close to providing the figures required by the journal you are submitting too. If there are additional things you need to do your figure you can either edit the file directly in an image processing program (e.g. gimp or irfanview) or you can manipulate the file in R with the magick package, essentially an R client for ImageMagick. Using magick is a bit beyond the scope of what we want to do today, but I will show a quick example of something I had to do for a paper recently: remlibmagick++-devove white space around borders of the image. We can do with this using the auto crop functionality in magick.
library(magick)
nla_fig <- image_read("nla_bar_chart.tiff")
nla_fig <- image_trim(nla_fig)
image_write(nla_fig, "nla_bar_chart_trim.tiff")
The original tiff: